27 research outputs found

    Online Active Inference and Learning

    Get PDF
    ABSTRACT We present a generalized framework for active inference, the selective acquisition of labels for cases at prediction time in lieu of the estimated labels of a predictive model. We develop techniques within this framework for classifying in an online setting, for example, for classifying the stream of web pages where online advertisements are being served. Stream applications present novel complications because (i) we don't know at the time of label acquisition what instances we will see, (ii) instances repeat based on some unknown (and possibly skewed) distribution. To address the complications, we combine ideas from decision theory, cost-sensitive learning, online density estimation, and we introduce a method for on-line estimation of the utility distribution that allows us to manage the budget over the stream. The resulting model tells which instances to label so that by the end of each budget period, the budget is best spent (in expectation). We test the method on streams from a real application. The main results show that: (1) our proposed approach to active inference on streams can indeed reduce error costs substantially over alternative approaches, (2) more sophisticated online estimations achieve larger reductions in error. We then discuss the setting of simultaneously conducting active inference and active learning. We argue and provide some support that our expected-utility active inference strategy also selects good examples for learning

    Dominant Color Learning by Subject Extraction

    Get PDF
    Presented at the Women in Machine Learning Workshop (WiML ’12), Research Poster, Lake Tahoe, Nevada, USA.Advances in the digital media industry have resulted in an exponential growth in available image data sets. This exponential growth has in turn spurred great interest in various methods for acquiring, processing, analyzing, and understanding images in order to produce numerical or symbolic information such as color and texture characteristics. Detecting the dominant color of an object in the image without any prior knowledge about the background model, the object characteristics or the scene geometry is a challenging problem. The two major challenges in assigning a dominant color to the image subject are the isolation of the subject by background subtraction and the extraction of dominant color from the approximated subject region. In this work, we combine an estimated subject mask with the image color histogram to detect the dominant image color

    Improving fairness in machine learning systems: What do industry practitioners need?

    Full text link
    The potential for machine learning (ML) systems to amplify social inequities and unfairness is receiving increasing popular and academic attention. A surge of recent work has focused on the development of algorithmic tools to assess and mitigate such unfairness. If these tools are to have a positive impact on industry practice, however, it is crucial that their design be informed by an understanding of real-world needs. Through 35 semi-structured interviews and an anonymous survey of 267 ML practitioners, we conduct the first systematic investigation of commercial product teams' challenges and needs for support in developing fairer ML systems. We identify areas of alignment and disconnect between the challenges faced by industry practitioners and solutions proposed in the fair ML research literature. Based on these findings, we highlight directions for future ML and HCI research that will better address industry practitioners' needs.Comment: To appear in the 2019 ACM CHI Conference on Human Factors in Computing Systems (CHI 2019

    Cleaning search results using term distance features

    No full text
    The presence of Web spam in query results is one of the critical challenges facing search engines today. While search engines try to combat the impact of spam pages on their results, the incentive for spammers to use increasingly sophisticated techniques has never been higher, since the commercial success of a Web page is strongly correlated to the number of views that page receives. This paper describes a term-based technique for spam detection based on a simple new summary data structure called Term Distance Histograms that tries to capture the topical structure of a page. We apply this technique as a post-filtering step to a major search engine. Our experiments show that we are able to detect many of the artificially generated spam pages that remained in the results of the engine. Specifically, our method is able to detect many web pages generated by utilizing techniques such as dumping, weaving, or phrase stitching [11], which are spamming techniques designed to achieve high rankings while still exhibiting many of the individual word frequency (and even bi-gram) properties of natural human text
    corecore